Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 107
Filtrar
1.
Bioinformatics ; 2024 May 06.
Artigo em Inglês | MEDLINE | ID: mdl-38710497

RESUMO

MOTIVATION: Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have achieved considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. RESULTS: For this sake, in this paper we propose a graph structure learning (GSL) based MPP approach, called GSL-MPP. Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations. Then, with molecular fingerprints, we construct a molecular similarity graph (MSG). Following that, we conduct graph structure learning on the MSG, i.e., molecule-level graph structure learning, to get the final molecular embeddings, which are the results of fusing both GNN encoded molecular representations and the relationships among molecules. That is, combining both intra-molecule and inter-molecule information. Finally, we use these molecular embeddings to perform MPP. Extensive experiments on ten various benchmark datasets show that our method could achieve state- of-the-art performance in most cases, especially on classification tasks. Further visualization studies also demonstrate the good molecular representations of our method. AVAILABILITY: Source code is available at https://github.com/zby961104/GSL-MPP.

2.
J Chem Inf Model ; 64(7): 2921-2930, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-38145387

RESUMO

Self-supervised pretrained models are gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pretrained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations has not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hopping (SH) in traditional Quantitative Structure-Activity Relationship analysis, we propose a method named Representation-Property Relationship Analysis (RePRA) to evaluate the quality of the representations extracted by the pretrained model and visualize the relationship between the representations and properties. The concepts of ACs and SH are generalized from the structure-activity context to the representation-property context, and the underlying principles of RePRA are analyzed theoretically. Two scores are designed to measure the generalized ACs and SH detected by RePRA, and therefore, the quality of representations can be evaluated. In experiments, representations of molecules from 10 target tasks generated by 7 pretrained models are analyzed. The results indicate that the state-of-the-art pretrained models can overcome some shortcomings of canonical Extended-Connectivity FingerPrints, while the correlation between the basis of the representation space and specific molecular substructures are not explicit. Thus, some representations could be even worse than the canonical fingerprints. Our method enables researchers to evaluate the quality of molecular representations generated by their proposed self-supervised pretrained models. And our findings can guide the community to develop better pretraining techniques to regularize the occurrence of ACs and SH.


Assuntos
Fármacos Anti-HIV , Descoberta de Drogas , Hidrolases , Aprendizagem , Relação Quantitativa Estrutura-Atividade
3.
BMC Genomics ; 23(Suppl 6): 864, 2023 Nov 09.
Artigo em Inglês | MEDLINE | ID: mdl-37946133

RESUMO

BACKGROUND: The rapid devolvement of single cell RNA sequencing (scRNA-seq) technology leads to huge amounts of scRNA-seq data, which greatly advance the research of many biomedical fields involving tissue heterogeneity, pathogenesis of disease and drug resistance etc. One major task in scRNA-seq data analysis is to cluster cells in terms of their expression characteristics. Up to now, a number of methods have been proposed to infer cell clusters, yet there is still much space to improve their performance. RESULTS: In this paper, we develop a new two-step clustering approach to effectively cluster scRNA-seq data, which is called TSC - the abbreviation of Two-Step Clustering. Particularly, by dividing all cells into two types: core cells (those possibly lying around the centers of clusters) and non-core cells (those locating in the boundary areas of clusters), we first clusters the core cells by hierarchical clustering (the first step) and then assigns the non-core cells to the corresponding nearest clusters (the second step). Extensive experiments on 12 real scRNA-seq datasets show that TSC outperforms the state of the art methods. CONCLUSION: TSC is an effective clustering method due to its two-steps clustering strategy, and it is a useful tool for scRNA-seq data analysis.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Análise por Conglomerados , Análise de Dados , Algoritmos
4.
Genomics Proteomics Bioinformatics ; 21(5): 976-990, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37730114

RESUMO

A fundamental principle of biology is that proteins tend to form complexes to play important roles in the core functions of cells. For a complete understanding of human cellular functions, it is crucial to have a comprehensive atlas of human protein complexes. Unfortunately, we still lack such a comprehensive atlas of experimentally validated protein complexes, which prevents us from gaining a complete understanding of the compositions and functions of human protein complexes, as well as the underlying biological mechanisms. To fill this gap, we built Human Protein Complexes Atlas (HPC-Atlas), as far as we know, the most accurate and comprehensive atlas of human protein complexes available to date. We integrated two latest protein interaction networks, and developed a novel computational method to identify nearly 9000 protein complexes, including many previously uncharacterized complexes. Compared with the existing methods, our method achieved outstanding performance on both testing and independent datasets. Furthermore, with HPC-Atlas we identified 751 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-affected human protein complexes, and 456 multifunctional proteins that contain many potential moonlighting proteins. These results suggest that HPC-Atlas can serve as not only a computing framework to effectively identify biologically meaningful protein complexes by integrating multiple protein data sources, but also a valuable resource for exploring new biological findings. The HPC-Atlas webserver is freely available at http://www.yulpan.top/HPC-Atlas.


Assuntos
Biologia Computacional , Mapeamento de Interação de Proteínas , Humanos , Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas , Proteínas/metabolismo , Saccharomyces cerevisiae/metabolismo , Algoritmos
5.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37531266

RESUMO

MOTIVATION: Protein complexes are groups of polypeptide chains linked by non-covalent protein-protein interactions, which play important roles in biological systems and perform numerous functions, including DNA transcription, mRNA translation, and signal transduction. In the past decade, a number of computational methods have been developed to identify protein complexes from protein interaction networks by mining dense subnetworks or subgraphs. RESULTS: In this article, different from the existing works, we propose a novel approach for this task based on generative adversarial networks, which is called PCGAN, meaning identifying Protein Complexes by GAN. With the help of some real complexes as training samples, our method can learn a model to generate new complexes from a protein interaction network. To effectively support model training and testing, we construct two more comprehensive and reliable protein interaction networks and a larger gold standard complex set by merging existing ones of the same organism (including human and yeast). Extensive comparison studies indicate that our method is superior to existing protein complex identification methods in terms of various performance metrics. Furthermore, functional enrichment analysis shows that the identified complexes are of high biological significance, which indicates that these generated protein complexes are very possibly real complexes. AVAILABILITY AND IMPLEMENTATION: https://github.com/yul-pan/PCGAN.


Assuntos
Mapas de Interação de Proteínas , Saccharomyces cerevisiae , Humanos , Saccharomyces cerevisiae/metabolismo , Transdução de Sinais , Biossíntese de Proteínas
6.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37598424

RESUMO

Molecular property prediction (MPP) is a crucial and fundamental task for AI-aided drug discovery (AIDD). Recent studies have shown great promise of applying self-supervised learning (SSL) to producing molecular representations to cope with the widely-concerned data scarcity problem in AIDD. As some specific substructures of molecules play important roles in determining molecular properties, molecular representations learned by deep learning models are expected to attach more importance to such substructures implicitly or explicitly to achieve better predictive performance. However, few SSL pre-trained models for MPP in the literature have ever focused on such substructures. To challenge this situation, this paper presents a Chemistry-Aware Fragmentation for Effective MPP (CAFE-MPP in short) under the self-supervised contrastive learning framework. First, a novel fragment-based molecular graph (FMG) is designed to represent the topological relationship between chemistry-aware substructures that constitute a molecule. Then, with well-designed hard negative pairs, a is pre-trained on fragment-level by contrastive learning to extract representations for the nodes in FMGs. Finally, a Graphormer model is leveraged to produce molecular representations for MPP based on the embeddings of fragments. Experiments on 11 benchmark datasets show that the proposed CAFE-MPP method achieves state-of-the-art performance on 7 of the 11 datasets and the second-best performance on 3 datasets, compared with six remarkable self-supervised methods. Further investigations also demonstrate that CAFE-MPP can learn to embed molecules into representations implicitly containing the information of fragments highly correlated to molecular properties, and can alleviate the over-smoothing problem of graph neural networks.


Assuntos
Benchmarking , Descoberta de Drogas , Redes Neurais de Computação , Aprendizado de Máquina Supervisionado
7.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37505457

RESUMO

MOTIVATION: Contrastive learning has been widely used as pretext tasks for self-supervised pre-trained molecular representation learning models in AI-aided drug design and discovery. However, existing methods that generate molecular views by noise-adding operations for contrastive learning may face the semantic inconsistency problem, which leads to false positive pairs and consequently poor prediction performance. RESULTS: To address this problem, in this article, we first propose a semantic-invariant view generation method by properly breaking molecular graphs into fragment pairs. Then, we develop a Fragment-based Semantic-Invariant Contrastive Learning (FraSICL) model based on this view generation method for molecular property prediction. The FraSICL model consists of two branches to generate representations of views for contrastive learning, meanwhile a multi-view fusion and an auxiliary similarity loss are introduced to make better use of the information contained in different fragment-pair views. Extensive experiments on various benchmark datasets show that with the least number of pre-training samples, FraSICL can achieve state-of-the-art performance, compared with major existing counterpart models. AVAILABILITY AND IMPLEMENTATION: The code is publicly available at https://github.com/ZiqiaoZhang/FraSICL.


Assuntos
Benchmarking , Semântica , Modelos Moleculares
8.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37079731

RESUMO

MOTIVATION: Predicting molecular properties is one of the fundamental problems in drug design and discovery. In recent years, self-supervised learning (SSL) has shown its promising performance in image recognition, natural language processing, and single-cell data analysis. Contrastive learning (CL) is a typical SSL method used to learn the features of data so that the trained model can more effectively distinguish the data. One important issue of CL is how to select positive samples for each training example, which will significantly impact the performance of CL. RESULTS: In this article, we propose a new method for molecular property prediction (MPP) by Contrastive Learning with Attention-guided Positive-sample Selection (CLAPS). First, we generate positive samples for each training example based on an attention-guided selection scheme. Second, we employ a Transformer encoder to extract latent feature vectors and compute the contrastive loss aiming to distinguish positive and negative sample pairs. Finally, we use the trained encoder for predicting molecular properties. Experiments on various benchmark datasets show that our approach outperforms the state-of-the-art (SOTA) methods in most cases. AVAILABILITY AND IMPLEMENTATION: The code is publicly available at https://github.com/wangjx22/CLAPS.


Assuntos
Benchmarking , Projetos de Pesquisa , Desenho de Fármacos , Processamento de Linguagem Natural , Análise de Célula Única
9.
Artigo em Inglês | MEDLINE | ID: mdl-35139025

RESUMO

With the development of biomedical techniques in the past decades, causal gene identification has become one of the most promising applications in human genome-based business, which can help doctors to evaluate the risk of certain genetic diseases and provide further treatment recommendations for potential patients. When no controlled experiments can be applied, machine learning techniques like causal inference-based methods are generally used to identify causal genes. Unfortunately, most of the existing methods detect disease-related genes by ranking-based strategies or feature selection techniques, which generally return a superset of the corresponding real causal genes. There are also some causal inference-based methods that can identify a part of real causal genes from those supersets, but they are just able to return a few causal genes. This is contrary to our knowledge, as many results from controlled experiments have demonstrated that a certain disease, especially cancer, is usually related to dozens or hundreds of genes. In this work, we present an effective approach for identifying causal genes from gene expression data by using a new search strategy based on non-linear regression-based independence tests, which is able to greatly reduce the search space, and simultaneously establish the causal relationships from the candidate genes to the disease variable. Extensive experiments on real-world cancer datasets show that our method is superior to the existing causal inference-based methods in three aspects: 1) our method can identify dozens of causal genes, and 1/3  âˆ¼ 1/2 of the discovered causal genes can be verified by existing works that they are really directly related to the corresponding disease; 2) The discovered causal genes are able to distinguish the status or disease subtype of the target patient; 3) Most of the discovered causal genes are closely relevant to the disease variable.


Assuntos
Algoritmos , Neoplasias , Humanos , Aprendizado de Máquina , Neoplasias/genética , Neoplasias/metabolismo
10.
BMC Bioinformatics ; 23(Suppl 8): 339, 2022 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-35974329

RESUMO

BACKGROUND: Essential proteins are indispensable to the development and survival of cells. The identification of essential proteins not only is helpful for the understanding of the minimal requirements for cell survival, but also has practical significance in disease diagnosis, drug design and medical treatment. With the rapidly amassing of protein-protein interaction (PPI) data, computationally identifying essential proteins from protein-protein interaction networks (PINs) becomes more and more popular. Up to now, a number of various approaches for essential protein identification based on PINs have been developed. RESULTS: In this paper, we propose a new and effective approach called iMEPP to identify essential proteins from PINs by fusing multiple types of biological data and applying the influence maximization mechanism to the PINs. Concretely, we first integrate PPI data, gene expression data and Gene Ontology to construct weighted PINs, to alleviate the impact of high false-positives in the raw PPI data. Then, we define the influence scores of nodes in PINs with both orthological data and PIN topological information. Finally, we develop an influence discount algorithm to identify essential proteins based on the influence maximization mechanism. CONCLUSIONS: We applied our method to identifying essential proteins from saccharomyces cerevisiae PIN. Experiments show that our iMEPP method outperforms the existing methods, which validates its effectiveness and advantage.


Assuntos
Mapas de Interação de Proteínas , Proteínas de Saccharomyces cerevisiae , Algoritmos , Biologia Computacional/métodos , Ontologia Genética , Mapeamento de Interação de Proteínas/métodos , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
11.
Bioinformatics ; 38(14): 3582-3589, 2022 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-35652721

RESUMO

MOTIVATION: Accurately predicting drug-target interaction (DTI) is a crucial step to drug discovery. Recently, deep learning techniques have been widely used for DTI prediction and achieved significant performance improvement. One challenge in building deep learning models for DTI prediction is how to appropriately represent drugs and targets. Target distance map and molecular graph are low dimensional and informative representations, which however have not been jointly used in DTI prediction. Another challenge is how to effectively model the mutual impact between drugs and targets. Though attention mechanism has been used to capture the one-way impact of targets on drugs or vice versa, the mutual impact between drugs and targets has not yet been explored, which is very important in predicting their interactions. RESULTS: Therefore, in this article we propose MINN-DTI, a new model for DTI prediction. MINN-DTI combines an interacting-transformer module (called Interformer) with an improved Communicative Message Passing Neural Network (CMPNN) (called Inter-CMPNN) to better capture the two-way impact between drugs and targets, which are represented by molecular graph and distance map, respectively. The proposed method obtains better performance than the state-of-the-art methods on three benchmark datasets: DUD-E, human and BindingDB. MINN-DTI also provides good interpretability by assigning larger weights to the amino acids and atoms that contribute more to the interactions between drugs and targets. AVAILABILITY AND IMPLEMENTATION: The data and code of this study are available at https://github.com/admislf/MINN-DTI.


Assuntos
Redes Neurais de Computação , Proteínas , Humanos , Proteínas/química , Simulação por Computador , Desenvolvimento de Medicamentos/métodos , Descoberta de Drogas/métodos
12.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35172334

RESUMO

Single-cell RNA sequencing (scRNA-seq) permits researchers to study the complex mechanisms of cell heterogeneity and diversity. Unsupervised clustering is of central importance for the analysis of the scRNA-seq data, as it can be used to identify putative cell types. However, due to noise impacts, high dimensionality and pervasive dropout events, clustering analysis of scRNA-seq data remains a computational challenge. Here, we propose a new deep structural clustering method for scRNA-seq data, named scDSC, which integrate the structural information into deep clustering of single cells. The proposed scDSC consists of a Zero-Inflated Negative Binomial (ZINB) model-based autoencoder, a graph neural network (GNN) module and a mutual-supervised module. To learn the data representation from the sparse and zero-inflated scRNA-seq data, we add a ZINB model to the basic autoencoder. The GNN module is introduced to capture the structural information among cells. By joining the ZINB-based autoencoder with the GNN module, the model transfers the data representation learned by autoencoder to the corresponding GNN layer. Furthermore, we adopt a mutual supervised strategy to unify these two different deep neural architectures and to guide the clustering task. Extensive experimental results on six real scRNA-seq datasets demonstrate that scDSC outperforms state-of-the-art methods in terms of clustering accuracy and scalability. Our method scDSC is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/DHUDBlab/scDSC.


Assuntos
Redes Neurais de Computação , Análise de Célula Única , Análise por Conglomerados , Perfilação da Expressão Gênica , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos
13.
IEEE Trans Cybern ; 52(3): 1785-1797, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32525807

RESUMO

Link weight prediction is an important subject in network science and machine learning. Its applications to social network analysis, network modeling, and bioinformatics are ubiquitous. Although this subject has attracted considerable attention recently, the performance and interpretability of existing prediction models have not been well balanced. This article focuses on an unsupervised mixed strategy for link weight prediction. Here, the target attribute is the link weight, which represents the correlation or strength of the interaction between a pair of nodes. The input of the model is the weighted adjacency matrix without any preprocessing, as widely adopted in the existing models. Extensive observations on a large number of networks show that the new scheme is competitive to the state-of-the-art algorithms concerning both root-mean-square error and Pearson correlation coefficient metrics. Analytic and simulation results suggest that combining the weight consistency of the network and the link weight-associated latent factors of the nodes is a very effective way to solve the link weight prediction problem.


Assuntos
Algoritmos , Aprendizado de Máquina , Biologia Computacional/métodos , Simulação por Computador
14.
IEEE Trans Cybern ; 52(5): 3232-3243, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-32780709

RESUMO

This article addresses two important issues of causal inference in the high-dimensional situation. One is how to reduce redundant conditional independence (CI) tests, which heavily impact the efficiency and accuracy of existing constraint-based methods. Another is how to construct the true causal graph from a set of Markov equivalence classes returned by these methods. For the first issue, we design a recursive decomposition approach where the original data (a set of variables) are first decomposed into two small subsets, each of which is then recursively decomposed into two smaller subsets until none of these subsets can be decomposed further. Redundant CI tests can be reduced by inferring causalities from these subsets. The advantage of this decomposition scheme lies in two aspects: 1) it requires only low-order CI tests and 2) it does not violate d -separation. The complete causality can be reconstructed by merging all the partial results of the subsets. For the second issue, we employ regression-based CI tests to check CIs in linear non-Gaussian additive noise cases, which can identify more causal directions by [Formula: see text] (or [Formula: see text]). Consequently, causal direction learning is no longer limited by the number of returned V -structures and consistent propagation. Extensive experiments show that the proposed method can not only substantially reduce redundant CI tests but also effectively distinguish the equivalence classes.


Assuntos
Causalidade
15.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3425-3434, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34788219

RESUMO

Clustering analysis has been widely used in analyzing single-cell RNA-sequencing (scRNA-seq) data to study various biological problems at cellular level. Although a number of scRNA-seq data clustering methods have been developed, most of them evaluate the similarity of pairwise cells while ignoring the global relationships among cells, which sometimes cannot effectively capture the latent structure of cells. In this paper, we propose a new clustering method SPARC for scRNA-seq data. The most important feature of SPARC is a novel similarity metric that uses the sparse representation coefficients of each cell in terms of the other cells to measure the relationships among cells. In addition, we develop an outlier detection method to help parameter selection in SPARC. We compare SPARC with nine existing scRNA-seq data clustering methods on twelve real datasets. Experimental results show that SPARC achieves the state of the art performance. By further analyzing the cell similarity data derived from sparse representations, we find that SPARC is much more effective in mining high quality clusters of scRNA-seq data than two traditional similarity metrics. In conclusion, this study provides a new way to effectively cluster scRNA-seq data and achieves more accurate clustering results than the state of art methods.


Assuntos
Algoritmos , Benchmarking , Análise por Conglomerados , Análise de Sequência de RNA
16.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2512-2522, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33630737

RESUMO

Cellular programs often exhibit strong heterogeneity and asynchrony in the timing of program execution. Single-cell RNA-seq technology has provided an unprecedented opportunity for characterizing these cellular processes by simultaneously quantifying many parameters at single-cell resolution. Robust trajectory inference is a critical step in the analysis of dynamic temporal gene expression, which can shed light on the mechanisms of normal development and diseases. Here, we present TiC2D, a novel algorithm for cell trajectory inference from single-cell RNA-seq data, which adopts a consensus clustering strategy to precisely cluster cells. To evaluate the power of TiC2D, we compare it with three state-of-the-art methods on four independent single-cell RNA-seq datasets. The results show that TiC2D can accurately infer developmental trajectories from single-cell transcriptome. Furthermore, the reconstructed trajectories enable us to identify key genes involved in cell fate determination and to obtain new insights about their roles at different developmental stages.


Assuntos
Algoritmos , Análise de Célula Única , Análise por Conglomerados , Consenso , Perfilação da Expressão Gênica/métodos , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos
17.
BMC Bioinformatics ; 22(Suppl 6): 130, 2021 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-34078287

RESUMO

BACKGROUND: The rapid development of single-cell RNA sequencing (scRNA-seq) enables the exploration of cell heterogeneity, which is usually done by scRNA-seq data clustering. The essence of scRNA-seq data clustering is to group cells by measuring the similarities among genes/transcripts of cells. And the selection of features for cell similarity evaluation is of great importance, which will significantly impact clustering effectiveness and efficiency. RESULTS: In this paper, we propose a novel method called CaFew to select genes based on cluster-aware feature weighting. By optimizing the clustering objective function, CaFew obtains a feature weight matrix, which is further used for feature selection. The genes have large weights in at least one cluster or the genes whose weights vary greatly in different clusters are selected. Experiments on 8 real scRNA-seq datasets show that CaFew can obviously improve the clustering performance of existing scRNA-seq data clustering methods. Particularly, the combination of CaFew with SC3 achieves the state-of-art performance. Furthermore, CaFew also benefits the visualization of scRNA-seq data. CONCLUSION: CaFew is an effective scRNA-seq data clustering method due to its gene selection mechanism based on cluster-aware feature weighting, and it is a useful tool for scRNA-seq data analysis.


Assuntos
RNA Citoplasmático Pequeno , Análise de Célula Única , Algoritmos , Análise por Conglomerados , Perfilação da Expressão Gênica , Análise de Sequência de RNA
18.
Bioinformatics ; 37(18): 2981-2987, 2021 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-33769437

RESUMO

MOTIVATION: Molecular property prediction is a hot topic in recent years. Existing graph-based models ignore the hierarchical structures of molecules. According to the knowledge of chemistry and pharmacy, the functional groups of molecules are closely related to its physio-chemical properties and binding affinities. So, it should be helpful to represent molecular graphs by fragments that contain functional groups for molecular property prediction. RESULTS: In this article, to boost the performance of molecule property prediction, we first propose a definition of molecule graph fragments that may be or contain functional groups, which are relevant to molecular properties, then develop a fragment-oriented multi-scale graph attention network for molecular property prediction, which is called FraGAT. Experiments on several widely used benchmarks are conducted to evaluate FraGAT. Experimental results show that FraGAT achieves state-of-the-art predictive performance in most cases. Furthermore, our case studies show that when the fragments used to represent the molecule graphs contain functional groups, the model can make better predictions. This conforms to our expectation and demonstrates the interpretability of the proposed model. AVAILABILITY AND IMPLEMENTATION: The code and data underlying this work are available in GitHub, at https://github.com/ZiqiaoZhang/FraGAT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

19.
J Proteome Res ; 20(1): 1079-1086, 2021 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-33338382

RESUMO

Batch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. In order to facilitate the evaluation and correction of batch effects, here we present BatchSever, an open-source R/Shiny based user-friendly interactive graphical web platform for batch effects analysis. In BatchServer, we introduced autoComBat, a modified version of ComBat, which is the most widely adopted tool for batch effect correction. BatchServer uses PVCA (Principal Variance Component Analysis) and UMAP (Manifold Approximation and Projection) for evaluation and visualization of batch effects. We demonstrate its applications in multiple proteomics and transcriptomic data sets. BatchServer is provided at https://lifeinfor.shinyapps.io/batchserver/ as a web server. The source codes are freely available at https://github.com/guomics-lab/batch_server.


Assuntos
Biologia Computacional , Software
20.
IEEE Trans Image Process ; 30: 822-837, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33226946

RESUMO

Currently, video text spotting tasks usually fall into the four-staged pipeline: detecting text regions in individual images, recognizing localized text regions frame-wisely, tracking text streams and post-processing to generate final results. However, they may suffer from the huge computational cost as well as sub-optimal results due to the interferences of low-quality text and the none-trainable pipeline strategy. In this article, we propose a fast and robust end-to-end video text spotting framework named FREE by only recognizing the localized text stream one-time instead of frame-wise recognition. Specifically, FREE first employs a well-designed spatial-temporal detector that learns text locations among video frames. Then a novel text recommender is developed to select the highest-quality text from text streams for recognizing. Here, the recommender is implemented by assembling text tracking, quality scoring and recognition into a trainable module. It not only avoids the interferences from the low-quality text but also dramatically speeds up the video text spotting. FREE unites the detector and recommender into a whole framework, and helps achieve global optimization. Besides, we collect a large scale video text dataset for promoting the video text spotting community, containing 100 videos from 21 real-life scenarios. Extensive experiments on public benchmarks show our method greatly speeds up the text spotting process, and also achieves the remarkable state-of-the-art.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...